Unknown Word Identification for Chinese Morphological Analysis ∗

نویسنده

  • Chooi-Ling Goh
چکیده

Since written Chinese does not use blank spaces to indicate word boundaries, segmenting Chinese texts becomes an essential task for Chinese language processing. Besides word segmentation, we also need to identify the part-of-speech (POS) tags of the words. The segmentation and POS tagging process are denoted as morphological analysis. During the process of word segmentation, two main problems occur: segmentation ambiguities and unknown word occurrences. There are basically two types of segmentation ambiguities: covering ambiguity and overlapping ambiguity. These ambiguities are dealt with known words. For the unknown word problem, we need to detect them from the text based on the context. In this report, we have focused on the problem of unknown words and proposed some machine-learning based methods towards solving it. Besides, we also face the ambiguity problem with POS tagging because a single word can hold multiple POS tags and it depends on the context to decide which one is the correct answer. Furthermore, if the word is unknown, then we need to guess the POS tag based on the word components and contexts. At the end of the research, we have built a practical morphological analyzer which can be freely used by anyone for research purpose. In order to build a practical system, a reasonable size dictionary is needed. The initial dictionary is built from the Penn Chinese Treebank corpus v4.0 and contains only 33,438 entries. Since the initial dictionary is quite small, the unknown word detection method is applied to huge raw texts in order to extract new words to be added into the system dictionary. We have successfully constructed a dictionary with 120,769 entries. Finally, we propose a two-layer morphological analysis to cater for two sets of outputs. The first layer produces the minimal segmentation unit ∗Doctoral Dissertation, Department of Information Processing, Graduate School of Information Science, Nara Institute of Science and Technology, NAIST-IS-DD0361217, September 29, 2006.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese Unknown Word Identification Using Character-based Tagging and Chunking

Since written Chinese has no space to delimit words, segmenting Chinese texts becomes an essential task. During this task, the problem of unknown word occurs. It is impossible to register all words in a dictionary as new words can always be created by combining characters. We propose a unified solution to detect unknown words in Chinese texts. First, a morphological analysis is done to obtain i...

متن کامل

Hybrid Models for Chinese Unknown Word Resolution Dissertation

Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistic...

متن کامل

A Lexicon-Constrained Character Model for Chinese Morphological Analysis

This paper proposes a lexicon-constrained character model that combines both word and character features to solve complicated issues in Chinese morphological analysis. A Chinese character-based model constrained by a lexicon is built to acquire word building rules. Each character in a Chinese sentence is assigned a tag by the proposed model. The word segmentation and partof-speech tagging resul...

متن کامل

Semantic Classification of Chinese Unknown Words

This paper describes a classifier that assigns semantic thesaurus categories to unknown Chinese words (words not already in the CiLin thesaurus and the Chinese Electronic Dictionary, but in the Sinica Corpus). The focus of the paper differs in two ways from previous research in this particular area. Prior research in Chinese unknown words mostly focused on proper nouns (Lee 1993, Lee, Lee and C...

متن کامل

Morphological features help POS tagging of unknown words across language varieties

Part-of-speech tagging, like any supervised statistical NLP task, is more difficult when test sets are very different from training sets, for example when tagging across genres or language varieties. We examined the problem of POS tagging of different varieties of Mandarin Chinese (PRC-Mainland, PRCHong Kong, and Taiwan). An analytic study first showed that unknown words were a major source of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006